1 Introduction

In previous notebooks, we have assigned each cell to its condition (0h, 8h, etc.), we have filtered low-quality cells, and we have normalized gene counts to correct for biases such as differences in library size. The result of that is a SingleCellExperiment object that we saved as .RDS and that will be the starting point of this analysis.

Here, we aim to cluster cells to identify each cell type. Hence, we are going to use Seurat, a CRAN package that has become a swiss-knife for scRNA-seq analysis. As described in Kiselev et al, Seurat uses a graph-based clustering algorithm that is scalable to datasets with thousands of cells. Therefore, we will leverage such scalability to cluster >10,000 cells contained in the SCE object.

1.1 Package loading

library(SingleCellExperiment)
library(scater)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(Seurat)
library(cowplot)
library(ggpubr)
library(purrr)
library(tidyverse)

1.2 Source script with functions

source("bin/utils.R")

2 Create seurat object

Seurat uses its own single-cell data container, a so-called Seurat object. Hence, we first need to convert the SCE to this new data structure:

date <- Sys.Date()

# Load SingleCellExperiment
sce <- readRDS("results/R_objects/10X_SingleCellExperiment_filt&norm.RDS")

# To increase interpretability downstream, change rownames from ensembl to gene 
# symbol
rownames(sce) %>% 
  table() %>% 
  sort(decreasing = TRUE) %>% 
  head(10)
## .
##   PNRC2  SRSF10     7SK    A1BG A2M-AS1    AAAS    AACS   AAED1   AAGAB    AAK1 
##       2       2       1       1       1       1       1       1       1       1
ind <- match(c("PNRC2", "SRSF10"), rownames(sce))
rownames(sce)[ind] <- c("PNRC2.1", "SRSF10.1")

# Convert SCE to Seurat
seurat <- as.Seurat(sce)

As we have two donors (male/female), we will annotate them separately.

3 Find Variable Genes

To cluster our cells, we need to overcome 3 challenges:

  1. The ‘curse of dimensionality’: as each cell can be conceived as a vector with >10,000 genes, and as two random cells will have most of each genes equal, the distance measured between any given pair of cells will be very low, thus being unreliable for proper comparisons.
  2. Computational complexity: as the data is highly dimensional, even the most greedy algorithm will take long to complete.
  3. Most genes should not be differentially expressed between cells, so the observed differences in such genes will be due to technical issues or transient biological states, that may confound the true structure in the dataset.

A first approach to tackle these challenges consists of finding the most variable genes in the dataset (feature extraction). That is, to find the subset of genes that drive most of the variability in the expression matrix. Seurat calculates the average expression and dispersion for each gene. Then, it divides genes into bins based on its average, and for each bin computes a z-score per gene. Those genes with a z-score above a certain cutoff are categorized as highly variable. The binning step is vital, since genes with more UMI tend to have more dispersion.

seurat_l <- SplitObject(seurat, split.by = "donor")
seurat_l <- purrr::map(seurat_l, FindVariableFeatures)
purrr::map(seurat_l, ~ length(VariableFeatures(.x)))
## $male
## [1] 2000
## 
## $female
## [1] 2000
purrr::map(seurat_l, VariableFeaturePlot)
## $male

## 
## $female

As we can see, we reduce the number of dimensions from >10,000 genes to 2000 HVG.

4 Scale data

An important pre-processing step in any cluster analysis is to scale the data, as otherwise variables with a higher mean will have a higher weight in the distance metric. We regress out the “batch” variable:

seurat_l <- purrr::map(seurat_l, ScaleData, vars.to.regress = "batch")

5 Linear dimensionality reduction (PCA)

An additional challenge in our cluster analysis is that scRNAs-seq is very noisy (very susceptible to technical artifacts), and very sparse (contains drop-outs). Thus, differences in single genes may not be accurate to identify cell types. To that end, we can perform PCA, in which each PC can be conceived as a ‘metagene’ that includes information across a correlated gene set. Furthermore, we will reduce the dimensionality even more.

seurat_l <- purrr::map(seurat_l, RunPCA)
purrr::map(seurat_l, VizDimLoadings, dims = 1:2)
## $male

## 
## $female

purrr::map(seurat_l, PCHeatmap, dims = 1, balanced = TRUE)

## $male
## NULL
## 
## $female
## NULL

6 Cluster cells

Seurat uses the Louvain algorithm to cluster cells:

seurat_l <- purrr::map(seurat_l, FindNeighbors, dims = 1:15)
seurat_l <- purrr::map(seurat_l, FindClusters, resolution = 0.05)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 7471
## Number of edges: 281204
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9739
## Number of communities: 4
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2893
## Number of edges: 110323
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9781
## Number of communities: 4
## Elapsed time: 0 seconds

7 Non-linear dimensionality reduction (UMAP/tSNE)

We can visualize the former clusters with a t-Stochastic Neighbor Embedding (tSNE) and a Uniform Manifold Approximation and Projection (UMAP), which allow to depict more structure in the data than PCA:

seurat_l <- purrr::map(seurat_l, RunTSNE, dims = 1:15)
seurat_l <- purrr::map(seurat_l, RunUMAP, dims = 1:15)
seurat_l <- purrr::map(seurat_l, function(s) {
  s@reductions$PCA <- NULL
  s@reductions$TSNE <- NULL
  s@reductions$UMAP <- NULL
  s
})
purrr::map(seurat_l, DimPlot, reduction = "tsne")
## $male

## 
## $female

purrr::map(seurat_l, DimPlot, reduction = "umap")
## $male

## 
## $female

As we can see, there are 5 major clusters. Interestingly:

purrr::map(seurat_l, function(s) {
  Idents(s) <- "batch"
  DimPlot(s)
})
## $male

## 
## $female

Regressing out the batch effect with the ScaleData function from above removed the majority of the batch effect.

8 Find differentially expressed genes (cluster biomarkers)

Let us find the markers of each of the clusters above. In other words, let us find which genes are exclusively expressed in each cluster and will help us identify the cell types in our data set:

markers <- purrr::map(seurat_l, FindAllMarkers, only.pos = TRUE)
DT::datatable(markers$male)
DT::datatable(markers$female)
saveRDS(markers, file = "results/R_objects/pbmc_markers_list.rds")
marker_selection <- purrr::map2(seurat_l, markers, function(seurat, df) {
  num_k <- length(unique(seurat$seurat_clusters))
  selection <- unlist(purrr::map(0:(num_k-1), ~df[df$cluster == .x, "gene"][0:8]))
  selection
})
marker_selection <- purrr::map(marker_selection, ~ c("IL7R", .x))
heatmaps_markers <- purrr::map2(seurat_l, marker_selection, ~DoHeatmap(.x, features = .y) + NoLegend())
heatmaps_markers
## $male

## 
## $female

8.1 Assigning cell type identity to clusters

Based on the previously found markers, we can annotate each cluster to known cell types:

8.2 Male

Cluster ID Markers Cell Type
0 IL7R T cells
1 GNLY, NKG7 Natural Killer (NK)
2 LYZ, S100A8 Monocytes
3 MS4A1, CD79B B cells
seurat_l$male$cell_type <- factor(seurat_l$male$seurat_clusters)
levels(seurat_l$male$cell_type) <- c("T", "NK", "Monocyte", "B")

8.3 Female

Cluster ID Markers Cell Type
0 IL7R T cells
1 GNLY, NKG7 Natural Killer (NK)
2 LYZ, S100A8 Monocytes
3 MS4A1, CD79B B cells
seurat_l$female$cell_type <- factor(seurat_l$female$seurat_clusters)
levels(seurat_l$female$cell_type) <- c("T", "NK", "Monocyte", "B")

We can visualize the annotation with all cells pooled together:

cell_type_df <- purrr::map(seurat_l, ~ .x@meta.data[, "cell_type", drop = FALSE])
cell_type_df <- cell_type_df %>% 
  purrr::map(rownames_to_column, var = "barcode") %>% 
  bind_rows(.id = "donor")
rownames(cell_type_df) <- cell_type_df$barcode
cell_type_df <- cell_type_df[colnames(seurat), ]
seurat$cell_type <- cell_type_df$cell_type
seurat <- pre_process_seurat(seurat, vars_to_regress = "batch")
Idents(seurat) <- "cell_type"
umap_annot <- DimPlot(seurat, reduction = "umap", label = TRUE) + NoLegend()
umap_markers <- FeaturePlot(seurat, features = c("IL7R", "GNLY", "LYZ", "MS4A1"), reduction = "umap")
saveRDS(umap_annot, file = "results/R_objects/ggplots/umap_annotation_pbmc.rds")
saveRDS(umap_markers, file = "results/R_objects/ggplots/umap_markers_pbmc.rds")
umap_annot

umap_markers

Let us summarise the number of cells per cell type statified by donor and temperature

qc_cell_type_pbmc <- seurat@meta.data
qc_cell_type_pbmc$temperature <- case_when(
  qc_cell_type_pbmc$condition == "0h" ~ "0h",
  qc_cell_type_pbmc$condition %in% c("2h", "8h", "24h_RT", "48h_RT") ~ "21ºC",
  qc_cell_type_pbmc$condition %in% c("24h_4C", "48h_4C") ~ "4ºC"
)
qc_cell_type_pbmc$time <- qc_cell_type_pbmc$condition %>%
  str_remove("_4C") %>%
  str_remove("_RT")
levels(qc_cell_type_pbmc$cell_type) <- c("T-cell", "NK", "Monocyte", "B-cell")
qc_cell_type_pbmc <- qc_cell_type_pbmc %>%
  mutate(time = factor(time, levels = c("0h", "2h", "8h", "24h", "48h")),
         temperature = factor(temperature, levels = c("0h", "21ºC", "4ºC")),
         donor = factor(donor, levels = c("male", "female"))) %>%
  group_by(time, donor, temperature, cell_type) %>%
  summarise(num_cells = n()) %>%
  ungroup()
colnames(qc_cell_type_pbmc)[4] <- "annotation"
qc_cell_type_pbmc <- add_column(
  qc_cell_type_pbmc,
  experiment = rep("PBMC", nrow(qc_cell_type_pbmc)),
  .after = 0
)
DT::datatable(qc_cell_type_pbmc)
saveRDS(qc_cell_type_pbmc, "results/R_objects/qc_summary_table_cell_type_pbmc.rds")

9 Save

# saveRDS(seurat, "results/R_objects/10X_pbmc_Seurat_clustered.RDS")
# saveRDS(seurat_l, "results/R_objects/10X_pbmc_Seurat_donors_list_clustered.RDS")



# Recode variables
# seurat <- colData(sce2)[, c("batch", "donor", "ident", "condition")]
# conds <- c("0h", "2h", "8h", "24h_RT", "48h_RT", "24h_4C", "48h_4C")
# sce2$condition <- factor(sce2$condition, levels = conds)
# 
# levels(sce2$condition) <- conds %>% 
#   str_remove("RT") %>% 
#   str_remove("_")
# sce2$temperature <- case_when(
#   sce2$condition == "0h" ~ "gold",
#   sce2$condition %in% c("2h", "8h", "24h", "24hBioabank", "48h") ~ "room temperature",
#   sce2$condition %in% c("24h4C", "48h4C") ~ "4ºC"
# )
# sce2$condition <- str_remove(as.character(sce2$condition), "4C")
# colnames(colData(sce2)) <- c("batch", "sex", "cell_type", "time", "temperature")
# 
# # Recode rowData variables
# rowData(sce2) <- rowData(sce)
# 
# # Save as RDS
# saveRDS(sce2, "results/R_objects/10X_SingleCellExperiment_clustered.RDS")

10 Session Info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.0               stringr_1.4.0               dplyr_0.8.4                 readr_1.3.1                 tidyr_1.0.2                 tibble_2.1.3                tidyverse_1.3.0             purrr_0.3.3                 ggpubr_0.2.5                magrittr_1.5                cowplot_1.0.0               Seurat_3.1.4                org.Hs.eg.db_3.10.0         AnnotationDbi_1.48.0        scater_1.14.6               ggplot2_3.3.0               SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1 DelayedArray_0.12.2         BiocParallel_1.20.1         matrixStats_0.55.0          Biobase_2.46.0              GenomicRanges_1.38.0        GenomeInfoDb_1.22.0         IRanges_2.20.2              S4Vectors_0.24.3            BiocGenerics_0.32.0         BiocStyle_2.14.4           
## 
## loaded via a namespace (and not attached):
##   [1] reticulate_1.14          tidyselect_1.0.0         RSQLite_2.2.0            htmlwidgets_1.5.1        grid_3.6.1               Rtsne_0.15               munsell_0.5.0            codetools_0.2-16         mutoss_0.1-12            ica_1.0-2                DT_0.12                  future_1.16.0            withr_2.1.2              colorspace_1.4-1         knitr_1.28               rstudioapi_0.11          ROCR_1.0-7               ggsignif_0.6.0           gbRd_0.4-11              listenv_0.8.0            Rdpack_0.11-1            labeling_0.3             GenomeInfoDbData_1.2.2   mnormt_1.5-6             bit64_0.9-7              farver_2.0.3             vctrs_0.2.3              generics_0.0.2           TH.data_1.0-10           xfun_0.12                R6_2.4.1                 ggbeeswarm_0.6.0         rsvd_1.0.3               bitops_1.0-6             assertthat_0.2.1         promises_1.1.0           scales_1.1.0             multcomp_1.4-12          beeswarm_0.2.3           gtable_0.3.0             npsurv_0.4-0             globals_0.12.5           sandwich_2.5-1           rlang_0.4.5              splines_3.6.1            lazyeval_0.2.2           broom_0.5.5             
##  [48] BiocManager_1.30.10      yaml_2.2.1               reshape2_1.4.3           modelr_0.1.6             crosstalk_1.0.0          backports_1.1.5          httpuv_1.5.2             tools_3.6.1              bookdown_0.18            gplots_3.0.3             RColorBrewer_1.1-2       ggridges_0.5.2           TFisher_0.2.0            Rcpp_1.0.3               plyr_1.8.6               zlibbioc_1.32.0          RCurl_1.98-1.1           pbapply_1.4-2            viridis_0.5.1            zoo_1.8-7                haven_2.2.0              ggrepel_0.8.1            cluster_2.1.0            fs_1.3.2                 data.table_1.12.8        RSpectra_0.16-0          magick_2.3               lmtest_0.9-37            reprex_0.3.0             RANN_2.6.1               mvtnorm_1.1-0            fitdistrplus_1.0-14      xtable_1.8-4             mime_0.9                 hms_0.5.3                patchwork_1.0.0          lsei_1.2-0               evaluate_0.14            readxl_1.3.1             gridExtra_2.3            compiler_3.6.1           KernSmooth_2.23-16       crayon_1.3.4             htmltools_0.4.0          later_1.0.0              RcppParallel_4.4.4       lubridate_1.7.4         
##  [95] DBI_1.1.0                dbplyr_1.4.2             MASS_7.3-51.5            Matrix_1.2-18            cli_2.0.2                gdata_2.18.0             metap_1.3                igraph_1.2.4.2           pkgconfig_2.0.3          sn_1.5-5                 numDeriv_2016.8-1.1      plotly_4.9.2             xml2_1.2.2               vipor_0.4.5              multtest_2.42.0          XVector_0.26.0           bibtex_0.4.2.2           rvest_0.3.5              digest_0.6.25            sctransform_0.2.1        RcppAnnoy_0.0.15         tsne_0.1-3               rmarkdown_2.1            cellranger_1.1.0         leiden_0.3.3             uwot_0.1.5               DelayedMatrixStats_1.8.0 shiny_1.4.0              gtools_3.8.1             lifecycle_0.1.0          nlme_3.1-145             jsonlite_1.6.1           BiocNeighbors_1.4.2      viridisLite_0.3.0        fansi_0.4.1              pillar_1.4.3             lattice_0.20-40          fastmap_1.0.1            httr_1.4.1               plotrix_3.7-7            survival_3.1-8           glue_1.3.1               png_0.1-7                bit_1.1-15.2             stringi_1.4.6            blob_1.2.1               BiocSingular_1.2.2      
## [142] caTools_1.18.0           memoise_1.1.0            irlba_2.3.3              future.apply_1.4.0       ape_5.3